The objective of this project is to predicts the price and returns of crude oil front month futures contracts. To this end, I select a cost function, build multiple models to identify parameters, use the models to makes predictions in the testing data, and then I use the cost function to rank the models.
Structure of report:
Selection of a cost function
Overview of the data
Preliminary visualizations
Preliminary models
Predictions in the testing data
Ranking models / selection of best model
Final visualizations
Assumptions used to identify a cost function:
Traders are risk-adverse: the larger the misprediction, the greater the cost.
Traders can take both short or long positions. Therefore, the cost function should penalize mispredicted positive or negative price changes equally.
Given the above assumptions, the criteria for the cost function is that it should be exponential and symmetric. For this reason, I select the standard average of Squared Errors as the cost function.
I have collected the data from two sources:
The Energy Information Administration website. I have merged and organized this data in a seperate Rmd file. I called this dataset ‘Fundamentals’ and I have loaded it below.
The Wharton Research Data Services (WRDS) data. I accessed this data from the WRDS cloud via a remote R connection. After merging and cleaning this data in a seperate Rmd file, I have called it ‘Econ Ind’ and I have loaded it below.
After loading the data, I merged the data and made further adjustments. One adjustement of relevance relates to the treatment of ‘NA’s. While the price data is available on a daily basis, other variables were available on a weekly, monthly, or yearly basis only. In this case, I filled the ’NA’ observations with the average value for the time period. It is worth noting that a better approach would have been to estimate the missing data points using available data and seasonality patterns.
I use data from 2003-12-01 to 2015-11-17 for the analysis, but show some preliminary plots of earlier data from 1983.
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
My dependent variable is daily return on the price of crude oil fronth month active contracts traded in New York Mercantile Exchange (NYMEX). I will also show predictions and visualizations of the price of the contract, though my central aim is to predict the returns.
The data available from EIA website on the price of active crude oil contracts only shows one price per day. Therefore, the returns that I calculate are inter -day returns. It would have been better to obtain opening and closing prices and calculate intra -day returns, but this data was not available on the EIA website. (Note however: the data is available on Columbia’s Bloomber terminals).
Note: Futures contracts prices reflect expectations of the price of the underlying product (in this case, crude oil) at a future point in time (delivery date). Multiple contracts for different future time periods trade at the same time. Typically, the contract that is closer to expiry (called ‘front month’) exhibits the highest trading volume.
Here I calculate prior day price and the interday return:
df$n <- 1:nrow(df)
df$previous_price <- df$Price[ifelse(df$n==1,df$n, df$n-1)]
df$return <- (df$Price - df$previous_price)/ df$previous_price *100
df <- df[ , -17]
df <- df[, c(18,2,1,3:17)]
The variables obtained from the above sources are described below. I also contruct additional variables.
comm_stocks Weekly U.S. Ending Stocks excluding SPR of Crude Oil (Thousand Barrels). Frequency of data: weekly. Source: EIA.
cost U.S. Nominal Cost per Foot of Crude Oil Wells Drilled (Dollars per Foot). Frequency of data: annual.
reserves U.S. Crude Oil Proved Reserves (Million Barrels). Frequency of data: annual.
spr U.S. Ending Stocks of Crude Oil in Strategic Petroleum Reserves (SPR) (Thousand Barrels). Frequency of data: weekly.
us_production U.S. Field Production of Crude Oil (Thousand Barrels per Day). Frequency of data: monthly.
consumption Consumption of Oil in the United States (’000 barrers per day), EIA
indstr_prod Industrial Production, Preliminary Series, SA.
cpi Consumer price index - percent change year ago period, NSA
unempl Unemployment rate (%), SA
fed.reserves Total reserves (mil us$) - end of period
ppi Producer Price Index: Depository Credit Intermediation, NSA
indstr.prod2 Industrial Production Index, SA
cad.usd.exch.bid Exchange rate (IDC) - London market close - Forward - BID - 30 day - CDN$/USD
cad.usd.exch.ask Exchange rate (IDC) - London market close - Forward - ASK - 30 day - CDN$/USD
I construct the following variables:
Days until contract expiry.
A Fourier series to model seasonality.
Days until contract expiry
I use the below definition of Contract 1 (a.k.a front month) to calculate the expiry date:
Contract Definition: A futures contract specifying the earliest delivery date. For crude oil, each contract expires on the third business day prior to the 25th calendar day of the month preceding the delivery month. If the 25th calendar day of the month is a non-business day, trading ceases on the third business day prior to the business day preceding the 25th calendar day. After a contract expires, Contract 1 for the remainder of that calendar month is the second following month.
| Day on the 25th | 3 business days before | Difference |
|---|---|---|
| Sunday | Tuesday | 5 days |
| Saturday | Tuesday | 4 days |
| Friday | Tuesday | 3 days |
| Thursday | Monday | 3 days |
| Wednesday | Friday | 5 days |
| Tuesday | Thursday | 5 days |
| Monday | Wednesday | 5 days |
Identify expiry date
date <- df$date
df2 <- data.frame(date)
df2$month <- as.numeric(format(df2$date, "%m"))
df2$year <- as.numeric(format(df2$date, "%Y"))
df2$day <- as.numeric(format(df2$date, "%d"))
df2$new_date <- as.Date(paste0(df2$year,"-",df2$month,"-25"))
df2$weekday <- as.numeric(format(df2$new_date, "%w"))
df2$days_to_subtr <- ifelse(df2$weekday <= 3, 5, ifelse(df2$weekday == 6, 4, 3)) # Note: Sunday corresponds to weekday 0.
df2$expiry1 <- df2$new_date - df2$days_to_subtr
# Check weekday of expiry day (should only be within 1 and 5)
summary(as.numeric(format(df2$expiry1, "%w")))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.697 4.000 5.000
# Expiry 2
df2$new_date2 <- as.Date(paste0(df2$year,"-",df2$month + 1,"-25"))
df2$weekday2 <- as.numeric(format(df2$new_date2, "%w"))
df2$days_to_subtr2 <- ifelse(df2$weekday2 <= 3, 5, ifelse(df2$weekday2 == 6, 4, 3)) # Note: Sunday corresponds to weekday 0.
df2$expiry2 <- df2$new_date2 - df2$days_to_subtr2
summary(as.numeric(format(df2$expiry2, "%w")))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 2.000 2.748 4.000 5.000 665
# Selecting correct expiry day
df2$comparison_day <- as.numeric(format(df2$expiry1, "%d"))
df2$expiration_date <- as.Date(ifelse(df2$day > df2$comparison_day, df2$expiry2, df2$expiry1))
df2$days_until_expiry <- as.numeric(df2$expiration_date-df2$date)
df$expiry_date <- df2$expiration_date
df$days_util_expiry <- df2$days_until_expiry
rm(df2)
rm(date)
df <- df[, c(1:3,19:20,4:18)]
Modelling seasonality
I use a fast Fourier series to model seasonality, as follows.
df$month <- as.numeric(format(df$date, "%m"))
df$sin1 <- sin(2*pi*df$month/12)
df$cos1 <- cos(2*pi*df$month/12)
df$sin2 <- sin(4*pi*df$month/12)
df$cos2 <- cos(4*pi*df$month/12)
df$sin3 <- sin(6*pi*df$month/12)
df$cos3 <- cos(6*pi*df$month/12)
df$sin4 <- sin(8*pi*df$month/12)
df$cos4 <- cos(8*pi*df$month/12)
df$sin5 <- sin(10*pi*df$month/12)
df$cos5 <- cos(10*pi*df$month/12)
Note: Due to some later conflicts between packages, I visualize a non-linear generalized additive model here (though it is more relevant later).
df2 <- df[complete.cases(df),]
training2 <- df2[df2$date < "2014-01-01",]
testing2 <- df2[df2$date >= "2014-01-01",]
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.12
gam1 <- gam::gam(return ~ s(date) + s(reserves) + s(comm_stocks) + s(spr) + s(us_production) + s(cost) + s(indstr_prod) + s(unempl) + s(cad_usd_exch_bid) + s(cpi) + s(reserves_fed) + s(ppi) + s(indstr_prod2) + s(month), data = training2)
par(mfcol = c(2,2))
plot(gam1, residuals = TRUE, pch =".", rugplot = FALSE)
## Warning in gplot.default(x = structure(c(12387, 12388, 12389, 12390,
## 12391, : The "x" component of "s(date)" has class "Date"; no gplot()
## methods available
par(mfcol = c(2,1))
Returns, Industrial production, and cost of drilling
library(ggplot2)
qplot(data = df, x = date, y = return, colour = cost, lwd = indstr_prod, xlab = "Time", ylab = "Crude Oil Active Future Contract ($ / Barrel)") + ggtitle("Returns, Industrial production, and cost of drilling") + geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 9 rows containing missing values (geom_point).
Price, Industrial production, and cost of drilling
qplot(data = df, x = date, y = Price, colour = cost, lwd = indstr_prod, xlab = "Time", ylab = "Crude Oil Active Future Contract ($ / Barrel)") + ggtitle("Price, Industrial production, and cost of drilling") + geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 9 rows containing missing values (geom_point).
chart_return <- function(x, xlab, y = df$return, ylab = "Returns"){
plot(y = y, x = x, xlab = xlab, ylab = ylab, pch = 1)
abline(lm(y ~ x), col = "blue", lwd = 1)
}
par(mfcol = c(2,2))
chart_return(df$expiry_date, "Expiry Day of Contract")
chart_return(df$days_util_expiry, "Number of days until expiry")
chart_return(df$reserves, "U.S. Crude Oil Proved Reserves (Million Barrels)")
chart_return(df$comm_stocks, "Ending Stocks excluding SPR of Crude Oil (Thousand Barrels)")
chart_return(df$unempl, "Unemployment rate (%)")
chart_return(df$cad_usd_exch_bid, "CDN$/USD exchange rate - 30 day BID")
chart_return(df$cad_usd_exch_ask, "CDN$/USD exchange rate - 30 day ASK")
chart_return(df$previous_price, "Previous day price")
chart_return(df$month, "Month")
chart_return(df$cos1, "First element of the seasonality pattern (cos1)")
chart <- function(x, xlab, y = df$Price, ylab = "Price"){
plot(y = y, x = x, xlab = xlab, ylab = ylab, pch = 1)
abline(lm(y ~ x), col = "blue", lwd = 1)
}
par(mfcol = c(2,2))
chart(df$expiry_date, "Expiry Date of the Contract")
chart(df$reserves, "U.S. Crude Oil Proved Reserves (Million Barrels)")
chart(df$us_production, "U.S. Field Production of Crude Oil (Thousand Barrels per Day)")
chart(df$spr, "U.S. Ending Stocks of Crude Oil in Strategic Petroleum Reserves (SPR) (Thousand Barrels)")
chart(df$comm_stocks, "Ending Stocks excluding SPR of Crude Oil (Thousand Barrels)")
chart(df$indstr_prod, "Industrial Production, Preliminary Series")
chart(df$cpi, "Consumer price index - percent change year ago period")
chart(df$ppi, "Producer Price Index")
df <- df[complete.cases(df),]
training <- df[df$date < "2014-01-01",]
testing <- df[df$date >= "2014-01-01",]
lm0 <- lm(return ~ date, training)
lm0b <- lm(Price ~ date, training)
library(stargazer)
##
## Please cite as:
##
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
stargazer(lm0, lm0b, type = 'text')
##
## ============================================================
## Dependent variable:
## ----------------------------
## return Price
## (1) (2)
## ------------------------------------------------------------
## date -0.00004 0.015***
## (0.00004) (0.0003)
##
## Constant 0.674 -138.259***
## (0.634) (4.367)
##
## ------------------------------------------------------------
## Observations 2,466 2,466
## R2 0.0004 0.496
## Adjusted R2 -0.00003 0.496
## Residual Std. Error (df = 2464) 2.340 16.113
## F Statistic (df = 1; 2464) 0.928 2,428.500***
## ============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
p <- qplot(data = df, x = date, y = return, xlab = "Time", ylab = "Returns")
coef_lm0 <- coef(lm0)
p + geom_abline(intercept = coef_lm0[1], slope = coef_lm0[2], col = "green", lwd = 1)
p <- qplot(data = df, x = date, y = Price, xlab = "Time", ylab = "Price ($ / Barrel)")
coef_lm0b <- coef(lm0b)
p + geom_abline(intercept = coef_lm0b[1], slope = coef_lm0b[2], col = "green", lwd = 1)
lm1a <- lm(return ~ date + sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + sin4 + cos4 + sin5 + cos5, training)
lm1b <- lm(Price ~ date + sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + sin4 + cos4 + sin5 + cos5, training)
lm1a_subset <- step(lm1a, trace = FALSE)
lm1b_subset <- step(lm1b, trace = FALSE)
ggplot(data = training, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = lm1a_subset$fitted.values), color = "green")
ggplot(data = training, aes(x=date))+
geom_line(aes(y = Price)) +
geom_line(aes(y = lm1b_subset$fitted.values), color = "green")
lm2a <- lm(return ~ . + comm_stocks * cost + reserves * us_production - previous_price - Price, training)
lm2b <- lm(Price ~ . + comm_stocks * cost + reserves * us_production - previous_price - return, training)
# next command is needed for later
X_test <- model.matrix(return ~ . + comm_stocks * cost + reserves * us_production - previous_price - Price, data = testing)
lm2a_subset <- step(lm2a, trace = FALSE)
lm2b_subset <- step(lm2b, trace = FALSE)
ggplot(data = training, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = lm2a_subset$fitted.values), colour = "green" )
ggplot(data = training, aes(x=date))+
geom_line(aes(y = Price)) +
geom_line(aes(y = lm2b_subset$fitted.values), colour = "green" )
lm3a <- lm(return ~ . + comm_stocks * cost + reserves * us_production - Price, training[,-28])
lm3b <- lm(Price ~ . + comm_stocks * cost + reserves * us_production - return, training[,-28])
lm3a_subset <- step(lm3a, trace = FALSE)
lm3b_subset <- step(lm3b, trace = FALSE)
ggplot(data = training, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = lm3a_subset$fitted.values), colour = "green" )
ggplot(data = training, aes(x=date))+
geom_line(aes(y = Price)) +
geom_line(aes(y = lm3b_subset$fitted.values), colour = "green" )
stargazer(lm1a, lm2a, lm3a_subset, type = 'text')
##
## ===========================================================================================
## Dependent variable:
## --------------------------------------------------------------------
## return
## (1) (2) (3)
## -------------------------------------------------------------------------------------------
## date -0.00004 -0.003
## (0.00004) (0.005)
##
## expiry_date 0.003 0.001***
## (0.005) (0.0003)
##
## days_util_expiry
##
##
## reserves 0.0002 -0.0001***
## (0.0002) (0.00004)
##
## comm_stocks -0.00001 -0.00001**
## (0.00001) (0.00000)
##
## cost -0.005
## (0.008)
##
## us_production 0.002* 0.0004**
## (0.001) (0.0002)
##
## spr 0.00000
## (0.00001)
##
## consumption -0.001
## (0.001)
##
## indstr_prod -1.198
## (2.365)
##
## unempl -0.128 -0.105**
## (0.252) (0.050)
##
## cad_usd_exch_bid 901.135** 1,047.993***
## (421.755) (332.795)
##
## cad_usd_exch_ask -901.005** -1,052.777***
## (421.817) (332.929)
##
## cpi 0.015
## (0.077)
##
## reserves_fed -0.00000
## (0.00001)
##
## ppi 0.010
## (0.027)
##
## indstr_prod2 -2.068
## (6.625)
##
## previous_price -0.031***
## (0.006)
##
## month -0.047 -0.032**
## (0.098) (0.015)
##
## sin1 0.111* -0.122
## (0.066) (0.372)
##
## cos1 -0.063 0.004 -0.279***
## (0.068) (0.133) (0.094)
##
## sin2 0.069 0.011
## (0.066) (0.182)
##
## cos2 -0.031 -0.023
## (0.067) (0.117)
##
## sin3 -0.055 -0.072
## (0.066) (0.123)
##
## cos3 -0.057 -0.031
## (0.067) (0.116)
##
## sin4 0.018 0.010
## (0.066) (0.087)
##
## cos4 0.007 0.067
## (0.067) (0.121)
##
## sin5 -0.037 -0.052
## (0.066) (0.076)
##
## cos5 0.004 0.039
## (0.068) (0.120)
##
## comm_stocks:cost 0.00000
## (0.00000)
##
## reserves:us_production -0.00000
## (0.00000)
##
## Constant 0.578 7.765 3.251
## (0.637) (20.589) (2.717)
##
## -------------------------------------------------------------------------------------------
## Observations 2,466 2,466 2,466
## R2 0.003 0.010 0.015
## Adjusted R2 -0.001 -0.002 0.011
## Residual Std. Error 2.342 (df = 2454) 2.342 (df = 2436) 2.327 (df = 2455)
## F Statistic 0.688 (df = 11; 2454) 0.819 (df = 29; 2436) 3.812*** (df = 10; 2455)
## ===========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
stargazer(lm1b, lm2b, lm3b_subset, type = 'text')
##
## ============================================================================================================
## Dependent variable:
## -------------------------------------------------------------------------------------
## Price
## (1) (2) (3)
## ------------------------------------------------------------------------------------------------------------
## date 0.015*** -0.029**
## (0.0003) (0.013)
##
## expiry_date 0.051*** 0.001***
## (0.013) (0.0002)
##
## days_util_expiry
##
##
## reserves -0.003*** -0.0001***
## (0.001) (0.00003)
##
## comm_stocks -0.0004*** -0.00001***
## (0.00004) (0.00000)
##
## cost -0.006
## (0.019)
##
## us_production -0.007*** 0.0003**
## (0.002) (0.0001)
##
## spr 0.0002***
## (0.00002)
##
## consumption -0.004*
## (0.002)
##
## indstr_prod -80.957***
## (5.655)
##
## unempl -4.914*** -0.095***
## (0.601) (0.037)
##
## cad_usd_exch_bid -1,432.331 793.812***
## (1,008.332) (243.457)
##
## cad_usd_exch_ask 1,307.920 -798.343***
## (1,008.481) (243.567)
##
## cpi 2.309***
## (0.183)
##
## reserves_fed -0.0001***
## (0.00002)
##
## ppi -0.701***
## (0.065)
##
## indstr_prod2 23.925
## (15.839)
##
## month 0.493**
## (0.234)
##
## previous_price 0.973***
## (0.005)
##
## sin1 -1.148** 3.578*** 0.139***
## (0.448) (0.890) (0.050)
##
## cos1 -4.151*** -8.714*** -0.252***
## (0.458) (0.317) (0.069)
##
## sin2 -0.087 -1.230***
## (0.449) (0.435)
##
## cos2 -0.924** 0.073
## (0.456) (0.280)
##
## sin3 -0.509 0.084
## (0.449) (0.295)
##
## cos3 0.339 -0.069
## (0.456) (0.278)
##
## sin4 0.481 0.017
## (0.449) (0.208)
##
## cos4 -0.460 -1.394***
## (0.456) (0.288)
##
## sin5 0.034 -0.304*
## (0.447) (0.183)
##
## cos5 -0.431 -1.459***
## (0.458) (0.286)
##
## comm_stocks:cost -0.00000
## (0.00000)
##
## reserves:us_production 0.00000***
## (0.00000)
##
## Constant -137.538*** 204.699*** 3.450*
## (4.310) (49.224) (1.979)
##
## ------------------------------------------------------------------------------------------------------------
## Observations 2,466 2,466 2,466
## R2 0.515 0.940 0.994
## Adjusted R2 0.513 0.939 0.994
## Residual Std. Error 15.841 (df = 2454) 5.600 (df = 2436) 1.695 (df = 2455)
## F Statistic 237.058*** (df = 11; 2454) 1,312.432*** (df = 29; 2436) 43,986.520*** (df = 10; 2455)
## ============================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
library(lars)
## Loaded lars 1.2
X <- model.matrix(lm2a)
y <- training$return
lasso <- lars(X, y, type = "lasso", trace = FALSE)
summary(lasso)
## LARS/LASSO
## Call: lars(x = X, y = y, type = "lasso", trace = FALSE)
## Df Rss Cp
## 0 1 13497 -5.2705
## 1 2 13482 -6.0312
## 2 3 13482 -4.0474
## 3 4 13481 -2.2668
## 4 5 13472 -1.8637
## 5 6 13470 -0.2218
## 6 7 13467 1.2280
## 7 8 13466 3.0293
## 8 7 13464 0.7490
## 9 8 13460 2.0209
## 10 9 13459 3.7815
## 11 10 13458 5.5076
## 12 11 13457 7.4539
## 13 12 13456 9.2522
## 14 13 13455 11.0860
## 15 14 13450 12.0888
## 16 15 13449 13.9066
## 17 16 13447 15.6459
## 18 17 13447 17.5506
## 19 16 13444 15.1290
## 20 15 13442 12.7693
## 21 16 13438 13.8750
## 22 17 13434 15.3038
## 23 18 13427 15.9567
## 24 19 13427 17.9010
## 25 20 13426 19.7976
## 26 21 13426 21.6871
## 27 22 13421 22.8243
## 28 23 13417 24.0832
## 29 22 13415 21.6957
## 30 23 13411 23.0974
## 31 24 13411 24.9766
## 32 25 13411 26.9721
## 33 24 13409 24.6861
## 34 25 13408 26.4994
## 35 24 13406 24.1946
## 36 23 13406 22.1245
## 37 24 13406 24.0631
## 38 25 13404 25.8297
## 39 26 13404 27.6834
## 40 27 13403 29.5760
## 41 28 13401 31.2114
## 42 27 13400 29.1130
## 43 26 13400 27.0515
## 44 27 13399 28.8030
## 45 28 13398 30.5835
## 46 27 13397 28.4922
## 47 28 13396 30.4007
## 48 27 13396 28.3553
## 49 28 13396 30.2570
## 50 29 13394 31.8967
## 51 28 13393 29.6892
## 52 29 13392 31.5768
## 53 30 13376 30.5763
## 54 29 13375 28.5029
## 55 30 13370 29.6135
## 56 29 13370 27.6073
## 57 30 13369 29.4012
## 58 29 13369 27.3823
## 59 30 13367 29.0000
## 60 31 13367 31.0000
## 61 30 13367 29.0000
## 62 31 13367 31.0000
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
PCR <- pcr(return ~ . - Price - previous_price , data = training, validation = "LOO")
summary(PCR)
## Data: X dimension: 2466 28
## Y dimension: 2466 1
## Fit method: svdpc
## Number of components considered: 28
##
## VALIDATION: RMSEP
## Cross-validated using 2466 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 2.34 2.341 2.342 2.343 2.343 2.344 2.344
## adjCV 2.34 2.341 2.342 2.343 2.343 2.344 2.344
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 2.343 2.344 2.345 2.346 2.347 2.348 2.349
## adjCV 2.343 2.344 2.345 2.346 2.347 2.348 2.349
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 2.349 2.35 2.351 2.352 2.353 2.353
## adjCV 2.349 2.35 2.351 2.352 2.353 2.353
## 20 comps 21 comps 22 comps 23 comps 24 comps 25 comps
## CV 2.354 2.355 2.356 2.357 2.359 2.36
## adjCV 2.354 2.355 2.356 2.357 2.359 2.36
## 26 comps 27 comps 28 comps
## CV 2.36 2.359 4.683
## adjCV 2.36 2.359 2.834
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 74.977266 90.565969 99.88253 99.98656 99.9962 99.9997
## return 0.002893 0.004697 0.04626 0.09734 0.1644 0.2407
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 99.9999 100.000 100.0000 100.0000 100.0000 100.0000
## return 0.3287 0.365 0.3808 0.4049 0.4569 0.4584
## 13 comps 14 comps 15 comps 16 comps 17 comps 18 comps
## X 100.0000 100.000 100.0000 100.0000 100.0000 100.0000
## return 0.4619 0.489 0.5041 0.5042 0.5085 0.5119
## 19 comps 20 comps 21 comps 22 comps 23 comps 24 comps
## X 100.0000 100.0000 100.0000 100.0000 100.0000 100.0000
## return 0.6003 0.6003 0.6056 0.6062 0.6067 0.6116
## 25 comps 26 comps 27 comps 28 comps
## X 100.0000 100.0000 100.0000 100.0000
## return 0.6451 0.6723 0.8749 0.9095
detach("package:mgcv")
library(gam)
gam1 <- gam::gam(return ~ s(date) + s(reserves) + s(comm_stocks) + s(spr) + s(us_production) + s(cost) + s(indstr_prod) + s(unempl) + s(cad_usd_exch_bid) + s(cpi) + s(reserves_fed) + s(ppi) + s(indstr_prod2) + s(month), data = training)
par(mfcol = c(1,1))
# Due to conflicts between packages, I cannot show the results to the following two commands below. Therefore I show the plots at the beginning.
# plot(gam1, residuals = TRUE, pch =".", rugplot = FALSE)
# summary(gam1)
library(bartMachine)
## Loading required package: rJava
## Loading required package: car
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: missForest
## Loading required package: itertools
## Loading required package: iterators
## Welcome to bartMachine v1.2.0! You have 0.53GB memory available.
set_bart_machine_num_cores(parallel::detectCores())
## bartMachine now using 2 cores.
bart <- bartMachine(X = training[,-(1:2)], y = training$return)
## bartMachine initializing with 50 trees...
## Now building bartMachine for regression ...
## evaluating in sample data...done
library(gbm)
## Loading required package: survival
## Loading required package: lattice
## Loading required package: parallel
## Loaded gbm 2.1.1
boosted <- gbm(return ~ . + comm_stocks * cost + reserves * us_production - date -expiry_date, data = training, interaction.depth = 4, shrinkage = 0.001)
## Distribution not specified, assuming gaussian ...
# lm0
testing$return_lm0 <- predict(lm0, newdata = testing)
testing$squarederror_lm0 <- with(testing, (return_lm0 - return)^2)
(average_SE_lm0 <- mean(testing$squarederror_lm0))
## [1] 5.511555
# lm2
testing$return_lm2 <- predict(lm2a, newdata = testing)
## Warning in predict.lm(lm2a, newdata = testing): prediction from a rank-
## deficient fit may be misleading
testing$squarederror_lm2 <- with(testing, (return_lm2 - return)^2)
(average_SE_lm2 <- mean(testing$squarederror_lm2))
## [1] 5.530715
# lm3a_subset
testing$return_lm3_subset <- predict(lm3a_subset, newdata = testing)
testing$squarederror_lm3_subset <- with(testing, (return_lm3_subset - return)^2)
(average_SE_lm3_subset <- mean(testing$squarederror_lm3_subset))
## [1] 6.250821
# lasso
lasso_return <- predict(lasso, newx = X_test)$fit
average_SE_lasso <- colMeans( (testing$return - lasso_return) ^ 2 )
(average_SE_lasso <- min(average_SE_lasso))
## [1] 5.519184
predictions_lasso <- lasso_return[,which.min(average_SE_lasso)]
# PCR
testing$PCR_return <- predict(PCR, newdata = testing)
average_SE_PCR <- colMeans( (testing$return - testing$PCR_return) ^ 2 )
(average_SE_PCR <- min(average_SE_PCR))
## [1] 5.49972
predictions_PCR <- testing$PCR_return[ , , which.min(average_SE_PCR)]
error_PCR <- predictions_PCR - testing$return
# Bart
testing$bart_return <- predict(bart, new_data = testing[,3:31])
testing$squarederror_bart <- with(testing, (bart_return - return)^2)
(average_SE_bart <- mean(testing$squarederror_bart))
## [1] 9.552035
# boosted
testing$boosted_return <- predict(boosted, newdata = testing, type = "response", n.trees = 100)
testing$squarederror_boost <- with(testing, (boosted_return - return)^2)
(average_SE_boosted <- mean(testing$squarederror_boost))
## [1] 5.520387
# gam
testing$return_gam <- predict(gam1, newdata = testing)
testing$squarederror_gam <- with(testing, (return_gam - return)^2)
(average_SE_gam <- mean(testing$squarederror_gam))
## [1] 6.668476
# compare to a guess of returns = 0 for each day.
testing$squarederror_return0 <- with(testing, (return)^2)
(average_SSE_return0 <- mean(testing$squarederror_return0))
## [1] 5.520292
According the average SE criteria (selected as the cost function in this project), the best model is the BART machine, which greatly outranks all others.
Note that the BART model predicted the worst. This is because BART assumes independent obeservations and therefore it is not well suited for timeseries data with autocorrelation. Further, the dataset has also been modified so that there isn’t any missing values and therefore, one of the benefits of BART - dealing with missing data - is not taken advantage of here.
The full ranking is:
| Rank Number | Model Name | Average Squared Errors |
|---|---|---|
| 1 | Principal Components Regression | 5.49972 |
| 2 | Bivariate regression of return on time | 5.511555 |
| 3 | Lasso | 5.519184 |
| 4 | Guessing a return of zero | 5.520292 |
| 5 | Boosting | 5.520741 |
| 6 | Multivariate Regression with all inputs except prior price | 5.530715 |
| 7 | Subset of multivariate linear regression with all inputs (using step function) | 6.250821 |
| 8 | General Additive Model | 6.668476 |
| 9 | Bart Machine | \(9.552035\) |
library(ggplot2)
ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = predictions_PCR), colour = "green" ) + ggtitle("Principal Component Regression in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")
ggplot(data = testing, aes(x = date)) + geom_line(aes(y = error_PCR)) + ggtitle("Plot of errors in Testing Data") + ylab("Error") + xlab("Time")
ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = bart_return), colour = "green" ) + ggtitle("BART machine in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")
ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = return_lm0), colour = "green" ) + ggtitle("Bivariate Regression in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")
ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = boosted_return), colour = "green" ) + ggtitle("Boosted Model in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")
ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = return_gam), colour = "green" ) + ggtitle("GAM Model in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")
ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = return_lm2), colour = "green" ) + ggtitle("Multivariate Regression 1 in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")
ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = return_lm3_subset), colour = "green" ) + ggtitle("Multivariate Regression 2 in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")
ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = predictions_lasso), colour = "green" ) + ggtitle("Lasso in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")